Warning: package 'reticulate' was built under R version 4.4.1
MadPy 2024-08-08
I use RStudio as my preferred IDE, and this talk was written from that perspective.
You can just as easily do this all in VSCode. But please…
git versioning nearly always breaks because git version control works on a line-by-line basisgit integrationBecause I don’t want to.
I like (and more importantly, know) RStudio. I also don’t use Python enough to bother learning VSCode.
There are utilities to install and manage Python packages and environments in RStudio2 but I don’t use them.
I use mamba3 for package managment, which works pretty much just like conda, except it solves the environment much more quickly because it uses libsolv C library instead of Python to solve dependencies.
Go to Tools -> Global Options
Enter the path to the python interpreter in your base environment
reticulate4 is a Tidyverse package containing tools for interoperability between Python and R.
These tools are not intended for standalone Python work but rather explicitly aimed at the integration of Python into R projects (and as such are closely tied to the reticulate package).
They “strongly suggest” using one of the IDEs available for doing data science in Python for Python-only projects.6
Allows calling Python from R in multiple ways including from RMarkdown/Quarto, sourcing Python scripts, importing Python modules, and using Python interactively within an R session
Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)
First thing you always want to do is set your environment:
Warning: package 'reticulate' was built under R version 4.4.1
Do this whether you’re working interactively, calling scripts, or authoring reports.
There is also use_python() that allows you to specify an alternative version of python other than the one you set in your global options, or use_virtualenv() to set a virtual environment instead of a conda environment.
Translates between R and Python objects (e.g., between R and Pandas data frames, or between R matrices and NumPy arrays)
Provides flexible binding to different versions of Python including virtual environments and Conda environments
py objectWhen you call library(reticulate), it creates the py object in the reticulate package environment.
It is the bridge between R and Python, through which you can run Python code and interact with Python objects.
The most common way you will interact with it is to access any Python object from the R environment using the $ operator, e.g., py$x.
Important
Always call library(reticulate) or you won’t be able to access the py object!
reticulate::import() can be used to import any installed Python module into your R environment.
Then you can call any function from that module in R using $.
Let’s say I have a Python script that defines a function:
If I’d like to use that function in R, I can source it using source_python().
Let’s say my collaborator wrote a Python script for processing some raw data. I’d like to work with the processed data in R, but my collaborator only provided me with the raw data and the script.
I can use py_run_string() and py_run_file() to process the data, and then access any objects created into the Python main module using the py object exported by reticulate:
reticulate includes a Python engine for RMarkdown, and knitr v 1.18 and higher uses this engine by default.
Set your environment in your setup chunk:
Important
Always call library(reticulate) or you won’t be able to access the py object!
Then you can start inserting Python chunks just like you would R chunks, and knitr will knit everything together:
Just like when working interactively, you can access objects created in Python chunks in R by using the py object:
And you can access objects created in R chunks in Python by using the r object:
Quarto provides all the support for Python that RMarkdown does, plus support for Jupyter:
jupyter: python3 in your YAML header and make sure the paths to Python and Jupyter are in your PATH.kernelspec in your YAML:---
title: "My Document"
jupyter:
kernelspec:
name: xpython
language: "python"
display_name: "Python 3.7 (XPython)"
---There is other support for Python from the Quarto CLI as well, plus a VSCode Quarto plugin.8
When calling into Python, R data types are automatically converted to their equivalent Python types.
When values are returned from Python to R they are converted back to R types.9
| R | Python | Examples |
|---|---|---|
| Single-element vector | Scalar | 1, 1L, TRUE, "foo" |
| Multi-element vector | List | c(1.0, 2.0, 3.0), c(1L, 2L, 3L) |
| List of multiple types | Tuple | list(1L, TRUE, "foo") |
| Named list | Dict | list(a = 1L, b = 2.0), dict(x = x_data) |
| Matrix/Array | NumPy ndarray | matrix(c(1,2,3,4), nrow = 2, ncol = 2) |
| Data Frame | Pandas Dataframe | data.frame(x = c(1,2,3), y = c("a", "b", "c")) |
| Function | Python function | function(x) x + 1 |
| Raw | Python bytearray | as.raw(c(1:10)) |
| NULL, TRUE, FALSE | None, True, False | NULL, TRUE, FALSE |
The automatic conversion between R types and Python types works well in most cases, but sometimes you might want more control over the conversions.
If you’d like to work directly with Python objects by default you can pass convert = FALSE to the import() function.
Then when you’re done working with the object in Python, you can convert it to an R object explicitly with py_to_r().
Numeric types are different between R and Python. For example, 42 in R is a float, while in Python it’s an integer.
If you want to explicitly define a number as an integer in R so that it’s passed as such to Python, use the L suffix:
If a Python API requires a list but you’re only passing it a single element, you can wrap it in base list():
Python uses 0-based indices for collections:
while R uses 1-based indices:
Note
Notice the need to explicitly use an integer when slicing the Python object
Python indices are non-inclusive for the end range, while R indices are:
R and Python represent arrays and matrices in memory differently:
The most important thing to remember about this is that
R and Python print arrays differently.
reticulate supports the conversion of sparse matrices created by the Matrix R package to and from SciPy CSC matrices.11
I tried to make an example but working out the dependencies for scipy.sparse was way too much work.
https://rstudio.github.io/reticulate/articles/python_dependencies.html may have been helpful but I didn’t care enough.
As mentioned earlier, R data frames can be automatically converted to and from Pandas data frames. By default, columns are converted using the same rules governing R array <=> NumPy array conversion, with a couple extensions:
datetime64[ns]If the R data frame has row names, the generated Pandas data frame will be re-indexed using those row names, and vice versa.
If a Pandas data frame has a DatetimeIndex, it is converted to character vectors as R only supports character row names.
Pandas out of the box handles NAs differently than R:
However, Pandas has experimental support for nullable data types (represented by pd.NA), but you have to enable it first:
For advanced Python users, there’s more documentation on
Check out https://rstudio.github.io/reticulate/articles/calling_python.html for deets.
You can print documentation on any Python object using py_help():
This will open a text document outside of RStudio:
Help on built-in function chdir in module nt:
chdir(path)
Change the current working directory to the specified path.
path may always be specified as a string.
On some platforms, path may also be specified as an open file descriptor.
If this functionality is unavailable, using it raises an exception.
There is also this excellent article aimed at R users who are new to Python.
Remember this 2x2x2 array?
These are the exact same array, so why do they look different?
Python groups by the first index when printing, while R groups by the last index:
In the previous example I created an array in Python and ported it to R. What about the other way around?
The NumPy array will be created using column-major ordering:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
Remember:
F for “FORTRAN” (column-major order)C for “C” (row-major order)You can always create NumPy arrays in column-major order by passing the "F" flag:
You can rearrange R arrays into row-major order, but it’s gross.
The R dim() function is used to reshape arrays in R. This works by changing the dim attribute of the array, effectively re-interpreting the array indices using column-major semantics.
To overcome this, use reticulate::array_reshape() to reshape R arrays using row-major semantics.
https://lilicoding.github.io/papers/wang2020assessing.pdf
https://rstudio.github.io/reticulate/articles/python_packages.html
https://mamba.readthedocs.io/en/latest/index.html
https://rstudio.github.io/reticulate/
https://rstudio.github.io/reticulate/articles/calling_python.html
https://rstudio.github.io/reticulate/articles/rstudio_ide.html
https://rstudio.github.io/reticulate/articles/r_markdown.html
https://quarto.org/docs/computations/python.html
https://rstudio.github.io/reticulate/articles/calling_python.html#type-conversions
https://rstudio.github.io/reticulate/articles/arrays.html
https://rstudio.github.io/reticulate/articles/calling_python.html#sparse-matrices